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1. INTRODUCTION 

Data analysis and machine learning have become essential components of modern scientific 
methodologies, enabling automated techniques for predicting a phenomenon based on prior observations, 
uncovering underlying patterns in data, and providing insights into the problem. Random forest is one of the 
widely used in ensemble algorithm for machine learning. The splitting criteria in random forest is obtained 
predominantly by Gini index (GI) or information gain (IG), as Gini index has an edge over Information Gain it 
is widely used. The newly proposed Jenesis index will overcome the lacuna in Gini index. The accuracy 
enhanced in Jenesis index over Gini index is studied on the dataset of myocardial infractions in this paper. 

The complexity in real life problems is to test and relate from different data mining techniques and 
recognize the pattern with multiple techniques. The data base discussed in this study is about the myocardial 
infarction commonly known as heart attack. Among the various symptoms, the predominant symptoms are 
chest pain, discomfort in shoulder numbness, palpitation. These symptoms can be identified by changes on an 
electrocardiogram (ECG), change in ST segment, pathological Q waves, the heart wall motion change or 
autopsy. Predicting the severity of this complication using this data base is the need of the hour to avoid 
fatality. The mortality rate is always proportional to the acuteness of myocardial infarction so, it is a 
quintessential problem to be addressed in today’s world. In India around 54.5 million people are prone to 
cardio vascular disease. It is prevalent more in developed countries due to their poor diet and stress. The MI 
has a varying effect, patients with acute disorders are vulnerable to frequent illnesses or even fatality. Even 
experienced physicians cannot foresee the complications from the get-go, hence predicting the complications 
is necessary to prevent this disease. 
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2. PROPOSED METHOD - JENESIS INDEX 
In this proposed method, the demerits of both GI and IG are emended with Jenesis Index. Let Aj 


denote the i" record (out of total n sample records of the test test) of jt sample attribute (out of total m 
attributes of the test set) which are converted into numerical values. Let Sp be the number of sample rows and 
Sc be the number of sample columns. Let n';denotes the number of elements of a class (denoted by C) of the 


it” row’s jt” element in Sc. T; denote the number of ni’s in the outcome column showing YES or 1. The data 


contained in the table maybe ordinal/nominal or real. Therefore Algorithm 1 is applied to columns that have 
ordinal or nominal values and Algorithm 2 is applied to columns thtat have real values. 


Algorithm 1. Algorithm for columns with ordinal/nominal values 
Input: Train set 


Output: Probable node with Jenesis index 
1. Find all the unique values from the features 
2. Calculate the ratios of 0’s and 1’s in the target column using 
1. tn(0) = total 0’s / total target length 
2. tn(1) = total 1’s / total target length 
3. For each feature value 
a. Find number of occurrences (n) 
b. find the number of 0’s and 1’s corresponding target column (t). 
c. Calculate the ratio of target occurrence (v) using the function 
v(0) = t(0) / n and v(1) = t(1) /n. 
d. Find the summation(u) of v(0) and v(1) for all unique features. 
e. Calculate the ratio of v(0) and v(1) with 
p(0) =v(0) / u and p(1) =v(1) / u 
f. Calculate probability of 0 and 1 using 
(i) probability(0) = p(0) * tn(0) 


(ii) probability(1) = p(1) * tn(1) 
g. Calculate the final probability of the feature value with probability(0) + 
probability(1). 


4. The feature value with highest probability will be the split value for the feature 
and corresponding probability will be split probability for the feature. 

5. The feature with highest probability will be the split feature and corresponding 
probability will be split probability for the data set. 


Algorithm 2. Algorithm for columns with real values 
Input: Train set 
Output: Probable node with Jenesis index 
1. Pick all the unique values from the features 
2. Calculate the ratios of 0’s and 1’s in the target column using 
a. tn(0) = total 0’s / total target length 
b. tn(1) = total 1’s / total target length 
3. For each feature value 
a. Find number of occurrences (n) 
b. find the number of 0’s and 1’s corresponding target column t(0) and t(1). 
c. Calculate the final probability of the feature value with 
((t(0) / n) * tn(0) + (t(1) / n) * tn(1)) * 100. 
4. The feature value with highest probability will be the split value for the feature 
and corresponding probability will be split probability for the feature. 
5. The feature with highest probability will be the split feature and corresponding 
probability will be split probability for the data set. 


The following architectural diagram as shown in Figure 1 is a diagrammatic representation of Jenesis 
algorthim. The dataset was split into five folds with four folds for training and validation set and one fold for 
test set. The mean accuracy was calculated based on the scores computed for each fold. A confusion matrix is 
used to identify the number of true positives, true negatives, false positives and false negatives which is shown in 
Table 1. 

The confusion matrix Table 2 gave more insight into the accuracy of the predicted results. The 
results obtained from ORF-Jenesis were compared with the results obtained with RF-Gini. The confusion 
matrix of the RF-Gini and ORF-Jenesis as shown in Table 3 are observed in the analysis of myocardial 
infarctions. The aim was to focus on predicting true positives and true negatives and minimizing false 
negatives and false positives. 
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Figure 1. Architecture diagram 


Table 1. Confusion matrix 
Actual class 
Positive(1) Negative(0) 
Predicted class _ Positive(1) True Positive False Positive 
Negative(0) False Negative True Negatives 


Table 2. Confusion matrix for myocardial infarctions 
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; a 7 True False True False 
Function Accuracy Actual_Total_Positives Actual_Total_Negatives Positives Negatives Negatives Positives 
RF-Gini 80.58 2 63 272 3 

. 65 275 
ORF-Jenesis 81.47 4 61 273 2 


Table 3. Time complexity and space complexity 
No of trees Time Taken to complete execution Space Complexity 


RF-Gini 1 76 5.3 Mb 
5 383 
10 730 
20 1727 
ORF-Jenesis 1 11 8 Mb 
5 45 
10 101 
20 185 
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3. METHOD 

In random forests Breiman, [1] algorithms, decision trees play a pivotal role in deciding the node to 
split the tree and, in turn, creating decision trees from the test set to predict the percentange of trueness. 
While these approaches have shown to be a reliable, accurate, and useful tool for a wide range of machine 
learning problems, such as classification, regression, density estimation, manifold learning, and semi- 
supervised learning, we still have a lot to learn about them. G Louppe studied the induction of decision trees 
and the construction of ensembles of randomized trees showing their good computational performance and 
scalability, along with an in-depth discussion of their implementation details, as contributed within Scikit- 
Learn. Also, he analysed the interpretability of random forests in the eyes of variable importance measures. 
The core of our contribution’s rests in the theoretical characterization of the mean decrease of impurity 
variable importance measure and derived some of its properties in the case of multiway totally randomized 
trees and in asymptotic conditions [2]. 

Saffari et al. [3] combined the ideas from on-line bagging and extremely randomized forests and 
propose an on-line decision tree growing procedure and also on the temporal weighting scheme for 
adaptively discarding some trees based on their out-of-bag-error in given time intervals and consequently 
growing of new trees. Kalidas and Tamil [4] proposed method of AF detection, combining Markov models 
and random forests, achieves high accuracy across multiple databases and demonstrates comparable or 
superior performance to several other state-of-the-art algorithms. Kaur et al. [5] discuss the usage of random 
forest classifier to detect atrial fibrillation over a 10-fold cross validation. Gradient boosting is a technique 
has been used to predict the likelihood of acute myocardial infarction a study by Than et al. [6]. Yadav and 
Pal [7] combined pearson correlation and lasso regularization with random forest to achieve 99% accuracy 
with the heart disease dataset. Belhadj et al. [8] proposed a fuzzy version of gini index to improve the 
performance of gini index. A diverse range of research studies has been conducted around coronary illness. 
One such research work is the prediction of heart disease using a neural classifier to predict heart disease by 
Mathan et al. [9]. In order to improve the accuracy of the classifier several combinations of techniques such 
as fuzzy logic and weighting of decision trees [10]. 

Jain et al. [11] Investigated the joint splitting criteria using two of the most used criterions i.e., 
Information Gain and Gini index and proposed the data split points when Information Gain is maximum and 
Gini index is minimum. Kulkarni et al. [12] proposed a method to generate the individual decision tree in the 
random forest using randomly selecting one out of three split measures IG, GI and Gain ratio. Raileanu and 
Stoffel [13] has done the theoretical comparison of the most popular split criteria namely the GI and IG and 
have theoretically compared these two criteria. 

The general approach in in predicting random forest are: Decision trees —> Data Set — Training data 
set — Formation of rules — Test set — Classification — Result. 

Biau and Scornet emphasised on replacing mathematical forces for driving the algorithm, with 
special attention given to the selection of parameters, the resampling mechanism, and variable importance 
measures [14]. Information gain is another technique thar is predominantly used in classifiers, but they are 
seldom used because it complicated and the results are biased in unbalanced trees, Antonin Leroux suggests a 
method to improve the prediction accuracy using IG. [15]. 

The application of different patterns of heart disease by data mining techniques was studied by 
Kirmani and Ansarullah [16]. Lempitsky et al. [17] solved problem with random forests, which are 
discriminative classifiers developed lately in the machine learning field, allows for accurate delineations of 
the full 3D volume in a matter of seconds (on a CPU) or even in real-time (on a GPU). Yosefian et al. [18] 
determined the applicability of saturated tree (ST), pruned tree (PT), and RSF. Methods Khened et al. [19] 
proposed a fully automatic method for segmentation of left ventricle, right ventricle and myocardium from 
cardiac magnetic resonance (MR) images using densely connected fully convolutional neural network. 
dense convolutional neural network (DenseNet) facilitates multi-path flow for gradients between layers 
during training by back-propagation and feature propagation using random forest. The development of 
methods for precise quantification is critical for improving myocardial infarction patient diagnosis and 
therapy was studied by Allen et al. [20]. Mansoor et al. [21] found the using logistic regression and random 
forest, design and evaluate prediction models for all-cause in-hospital mortality in women hospitalised with 
STEMI, and compare the performance and validity of the different models. The efficacy of contemporary 
machine learning algorithms in individualised risk prediction for patients undergoing elective heart valve 
surgery was examined. Correct anticipation of this risk allows for the improved counselling of patients and 
avoidance of possible complications. We therefore investigated the benefit of modern machine learning 
methods in personalized risk prediction for patients undergoing elective heart valve surgery Bodenhofer et al. 
[22]. Asadi et al. [23] proposed method’s effectiveness is investigated by comparing its performance over six 
heart datasets with individual and ensemble classifiers. The results suggest that the proposed method with the 
(near) optimal number of classifiers outperforms the random forest algorithm with different classifiers. The 
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dataset on myocardial infarctions is obtained from the UCI repositories which as 1700 rows and 124 columns 
containing test and lab reports of patients’ data [24]. Zahibi et al. [25] proposed a method where the 
characteristics from the ECG signals' time, frequency, time-frequency domains, and phase space 
reconstruction are used. A random forest classifier is employed in the final stage to categorize the selected 
characteristics into one of the four aforementioned ECG classifications. 


3.1. Random forest 

Random forest algorithm is a predominantly used supervised machine learning algorithm that gives 
accurate predictions. The core idea behind random forest implementation is dividing the training set into sub- 
samples of data and constructing multiple trees. Gini index or information gain are the techniques that are 
used as attribute selection measures which in turn determines the splitting of the nodes while constructing the 
tree. The leaf nodes are then analysed using bootstrap aggregation or commonly called bagging to predict the 
outcome of a specific tuple. The primary concern of any algorithm is primarily based on two concepts: 
- The versatility in accommodating data with various factors will lead to trueness in prediction. 
- Enhancing the accuracy of the result of an algorithm. 


3.1.1. Random forest: existing attribute selection measure: 
The nodal points of the decision trees and the formation of rules are done through primarily by 
GINI index or information gain. The main features of Gini index and information gain are: 


Gini index: 

(a) Gini index = 1 — £? p? . 

(b) If Gini index is zero then the attributes are spread across equally. If Gini index= 1 then is pertaining to 
only one attribute. 

(c) In Decision tree if the value is less than 0.5, we accept and proceed further for next classification. 
Merits of Gini index: 

(i) Itis used for large partitions. 

(ii) It is useful in inequality measures. 

(iii) It is easy to implement. 

Demerits of Gini index: 

(i) Not compatible for more distinct values. 

(ii) The measure will give different results when applied to different sets. 

Example: 

Information Gain: 

Information gain is used for smaller partitions and more distinct values and comparatively hard to implement 
then Gini index. 

(i) Entropy = —2;"p; * log. p; 

(ii) Information Gain (Target, Predictor) = Entropy (Target)- Entropy (Predictor). 

(iii) Choose the largest information gain as split to proceed to the next step. 

Merits of Information Gain: 

(i) It is used for more distinct values. 

(ii) It is good measure for deciding the relevance of the attributes. 

Demerits of Information Gain: 

(i) Not compatible for large partitions. 

(ii) It is hard to implement. 


4. RESULTS AND DISCUSION 

The results obtained from the model showed that RF-Jenesis index performed better in comparison 
with RF-Gini. The dataset was first trained with 1000 rows and 100 columns. Most healthcare datasets 
contain more negatives than positives and the dataset in question is no different. It has 275 instances of 
negative instances and 65 occurences of positive instances. The accuracy achieved by ORF-Jenesis is 
calculated as 81.47% and the accuracy of RF-Gini is 80.58%. The target column in the dataset contains more 
instances of Os therefore the number of prediction of negatives is higher than the number of prediction of 
positives. From the total number of negative instances (275) present in the target column of the actual dataset 
273 instances where correctly predicted as true negative and 2 were incorrectly predicted as false positive by 
ORF-Jenesis, whereas RF-Gini could correctly predict 272 instances as true negative. From the total number 
of positive instances (65) present in the target column of the actual dataset 4 instances where correctly 
predicted as positive and 61 instances were incorrectly predicted as false negatives by ORF-Jenesis and 63 
were incorrectly predicted as false negatives by RF-Gini. From the confusion matrix f-measure, sensitivity 
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and specificity can be calculated. The accuracy of an algorithm depends on not just the prediction of the 

positives but on the prediction of negatives as well. ORF- Jenesis predicts better than RF-Gini on the positive 

scale as well as the negative scale. 

a) Time and space complexity: Performance of an algorithm is evaluated by means of time and space 
complexity. Time complexity and space complexity are shown in Table 3 after analysis of the dataset on 
myocardial infarctions 

b) Limitations of the proposed model: The result of the prediction purely depends on the balance of the 
dataset. The f-measure cannot be considered as a measure of accuracy in a dataset where the ratio of 
positives to negatives is low. The dataset used contains more instances of Os than 1s therefore f-measure 
and sensitivity is very minimal. 


5. CONCLUSION 

The classification problems the target attribute contains 0 s and 1 s. In classification problems that 
involve medical data the target attribute classifies where a patient has a disease or not. In a general sense it is 
not sufficient to specify whether a patient has a disease or not, in cases of cancer it is imperative to postulate 
the degree of infection. Therefore, it is inadequate to classify the target as just O and 1. The values in the 
attributes could be modified to predict a real value which would signify the degree of illness. 
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